This is my Exam 3 document

Lets load the data and take a look at it.

data<- read.csv(file = "BioLogData_Exam3.csv", sep = "|")
summary(data)
##        Sample.ID        Rep         Well        Dilution    
##  Clear_Creek:288   Min.   :1   A1     : 36   Min.   :0.001  
##  Soil_1     :288   1st Qu.:1   A2     : 36   1st Qu.:0.001  
##  Soil_2     :288   Median :2   A3     : 36   Median :0.010  
##  Waste_Water:288   Mean   :2   A4     : 36   Mean   :0.037  
##                    3rd Qu.:3   B1     : 36   3rd Qu.:0.100  
##                    Max.   :3   B2     : 36   Max.   :0.100  
##                                (Other):936                  
##                        Substrate       Hr_24            Hr_48       
##  2-Hydroxy Benzoic Acid     : 36   Min.   :0.0000   Min.   :0.0000  
##  4-Hydroxy Benzoic Acid     : 36   1st Qu.:0.0000   1st Qu.:0.0060  
##  D-Cellobiose               : 36   Median :0.0320   Median :0.2595  
##  D-Galactonic Acid γ-Lactone: 36   Mean   :0.1703   Mean   :0.4691  
##  D-Galacturonic Acid        : 36   3rd Qu.:0.1872   3rd Qu.:0.7220  
##  D-Glucosaminic Acid        : 36   Max.   :2.6500   Max.   :2.7850  
##  (Other)                    :936                                    
##      Hr_144       
##  Min.   :0.00000  
##  1st Qu.:0.04175  
##  Median :0.75200  
##  Mean   :0.92497  
##  3rd Qu.:1.67950  
##  Max.   :3.11600  
## 

Lets do some exploratory analysis

pairs(data)

class(data$Sample.ID)
## [1] "factor"
class(data$Rep)
## [1] "integer"
class(data$Well)
## [1] "factor"
class(data$Dilution)
## [1] "numeric"
class(data$Substrate)
## [1] "factor"
class(data$Hr_24)
## [1] "numeric"
class(data$Hr_48)
## [1] "numeric"
class(data$Hr_144)
## [1] "numeric"

Some regressions models and summary stats.

a<- lm(formula = Dilution ~ Hr_24, data = data)
summary(a)
## 
## Call:
## lm(formula = Dilution ~ Hr_24, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03664 -0.03607 -0.02750  0.06237  0.06787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.037644   0.001497  25.146   <2e-16 ***
## Hr_24       -0.003784   0.004173  -0.907    0.365    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04472 on 1150 degrees of freedom
## Multiple R-squared:  0.0007146,  Adjusted R-squared:  -0.0001544 
## F-statistic: 0.8223 on 1 and 1150 DF,  p-value: 0.3647
b<- lm(formula = Dilution ~ Hr_48, data = data)
summary(b)
## 
## Call:
## lm(formula = Dilution ~ Hr_48, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03713 -0.03571 -0.02745  0.06198  0.06650 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.038127   0.001710  22.296   <2e-16 ***
## Hr_48       -0.002403   0.002324  -1.034    0.301    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04472 on 1150 degrees of freedom
## Multiple R-squared:  0.0009286,  Adjusted R-squared:  5.981e-05 
## F-statistic: 1.069 on 1 and 1150 DF,  p-value: 0.3014
c<- lm(formula= Dilution ~ Hr_144, data = data)
summary(c)
## 
## Call:
## lm(formula = Dilution ~ Hr_144, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04068 -0.03168 -0.02651  0.05956  0.07303 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.041682   0.001923   21.68  < 2e-16 ***
## Hr_144      -0.005062   0.001520   -3.33 0.000896 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04452 on 1150 degrees of freedom
## Multiple R-squared:  0.00955,    Adjusted R-squared:  0.008689 
## F-statistic: 11.09 on 1 and 1150 DF,  p-value: 0.0008963

Hr 144 is the most significant to the Dilution factor.

hist(data$Dilution)

hist(data$Hr_144)

hist(data$Hr_48)

hist(data$Hr_24)

names(data)
## [1] "Sample.ID" "Rep"       "Well"      "Dilution"  "Substrate" "Hr_24"    
## [7] "Hr_48"     "Hr_144"
ggplot(data,aes(x=data$Dilution,y=data$Substrate)) +
  geom_boxplot() + facet_wrap(~Sample.ID)

fig1<-ggplot(data,aes(x=data$Hr_24,fill= Sample.ID)) +
  geom_histogram()



fig2<-ggplot(data,aes(x=data$Hr_24,fill= Substrate)) +
  geom_histogram()
ggplotly(fig1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fig3<-ggplot(data,aes(x=data$Hr_48,fill= Sample.ID)) +
  geom_histogram()



fig4<-ggplot(data,aes(x=data$Hr_48,fill= Substrate)) +
  geom_histogram()




fig5<-ggplot(data,aes(x=data$Hr_144,fill= Sample.ID)) +
  geom_histogram()



fig6<-ggplot(data,aes(x=data$Hr_144,fill= Substrate)) +
  geom_histogram()

ggplotly(fig3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(fig6)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

QUESTION 1

Which sample locations are functionally different from each other in terms of what C-substrates they can utilize?

levels(data$Sample.ID)
## [1] "Clear_Creek" "Soil_1"      "Soil_2"      "Waste_Water"
levels(data$Substrate)
##  [1] "2-Hydroxy Benzoic Acid"      "4-Hydroxy Benzoic Acid"     
##  [3] "D-Cellobiose"                "D-Galactonic Acid γ-Lactone"
##  [5] "D-Galacturonic Acid"         "D-Glucosaminic Acid"        
##  [7] "D-Mallic Acid"               "D-Mannitol"                 
##  [9] "D-Xylose"                    "D.L -α-Glycerol Phosphate"  
## [11] "Glucose-1-Phosphate"         "Glycogen"                   
## [13] "Glycyl-L-Glutamic Acid"      "i-Erythitol"                
## [15] "Itaconic Acid"               "L-Arginine"                 
## [17] "L-Asparganine"               "L-Phenylalanine"            
## [19] "L-Serine"                    "L-Threonine"                
## [21] "N-Acetyl-D-Glucosamine"      "Phenylethylamine"           
## [23] "Putrescine"                  "Pyruvic Acid Methyl Ester"  
## [25] "Tween 40"                    "Tween 80 "                  
## [27] "Water"                       "α-Cyclodextrin"             
## [29] "α-D-Lactose"                 "α-Ketobutyric Acid"         
## [31] "β-Methyl-D- Glucoside"       "γ-Hydroxybutyric Acid"

Soil values are functionally different than waste water values and clear creek values.

It looks like soil values are more different than waste water and clear creek because they utilize more carbon substrates.

QUESTION 2

Are Soil and Water samples significantly different overall (as in, overall diversity of usable carbon sources)? What about for individual carbon substrates?

creek <- creek %>% 
  mutate(diversity="water")
wastewater <- wastewater %>% 
  mutate(diversity="water")
soil1 <- soil1 %>%
  mutate(diversity="soil")
soil2 <- soil2 %>%
  mutate(diversity="soil")

data<- rbind(creek, wastewater, soil1, soil2)

mod1<- lm(data= data, values ~ Substrate * diversity)
summary(mod1) 
## 
## Call:
## lm(formula = values ~ Substrate * diversity, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3675 -0.3568 -0.1332  0.1984  2.6406 
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                          0.609204   0.082725
## Substrate4-Hydroxy Benzoic Acid                      0.412444   0.116991
## SubstrateD-Cellobiose                                0.409778   0.116991
## SubstrateD-Galactonic Acid γ-Lactone                 0.116667   0.116991
## SubstrateD-Galacturonic Acid                         0.382444   0.116991
## SubstrateD-Glucosaminic Acid                         0.234093   0.116991
## SubstrateD-Mallic Acid                               0.009667   0.116991
## SubstrateD-Mannitol                                  0.631981   0.116991
## SubstrateD-Xylose                                    0.333370   0.116991
## SubstrateD.L -α-Glycerol Phosphate                  -0.420556   0.116991
## SubstrateGlucose-1-Phosphate                         0.015981   0.116991
## SubstrateGlycogen                                    0.242000   0.116991
## SubstrateGlycyl-L-Glutamic Acid                      0.046333   0.116991
## Substratei-Erythitol                                 0.022241   0.116991
## SubstrateItaconic Acid                               0.292037   0.116991
## SubstrateL-Arginine                                  0.432741   0.116991
## SubstrateL-Asparganine                               0.758278   0.116991
## SubstrateL-Phenylalanine                             0.172370   0.116991
## SubstrateL-Serine                                    0.581315   0.116991
## SubstrateL-Threonine                                 0.061593   0.116991
## SubstrateN-Acetyl-D-Glucosamine                      0.595148   0.116991
## SubstratePhenylethylamine                            0.134333   0.116991
## SubstratePutrescine                                 -0.017556   0.116991
## SubstratePyruvic Acid Methyl Ester                   0.391852   0.116991
## SubstrateTween 40                                    0.296056   0.116991
## SubstrateTween 80                                    0.332741   0.116991
## SubstrateWater                                      -0.609204   0.116991
## Substrateα-Cyclodextrin                              0.042630   0.116991
## Substrateα-D-Lactose                                -0.074815   0.116991
## Substrateα-Ketobutyric Acid                         -0.093981   0.116991
## Substrateβ-Methyl-D- Glucoside                       0.056759   0.116991
## Substrateγ-Hydroxybutyric Acid                      -0.213148   0.116991
## diversitywater                                      -0.561426   0.116991
## Substrate4-Hydroxy Benzoic Acid:diversitywater      -0.210352   0.165450
## SubstrateD-Cellobiose:diversitywater                 0.051815   0.165450
## SubstrateD-Galactonic Acid γ-Lactone:diversitywater  0.101000   0.165450
## SubstrateD-Galacturonic Acid:diversitywater         -0.017000   0.165450
## SubstrateD-Glucosaminic Acid:diversitywater         -0.127370   0.165450
## SubstrateD-Mallic Acid:diversitywater                0.107093   0.165450
## SubstrateD-Mannitol:diversitywater                  -0.134852   0.165450
## SubstrateD-Xylose:diversitywater                    -0.332352   0.165450
## SubstrateD.L -α-Glycerol Phosphate:diversitywater    0.469278   0.165450
## SubstrateGlucose-1-Phosphate:diversitywater          0.198574   0.165450
## SubstrateGlycogen:diversitywater                     0.163907   0.165450
## SubstrateGlycyl-L-Glutamic Acid:diversitywater       0.108796   0.165450
## Substratei-Erythitol:diversitywater                  0.160963   0.165450
## SubstrateItaconic Acid:diversitywater               -0.168259   0.165450
## SubstrateL-Arginine:diversitywater                  -0.238370   0.165450
## SubstrateL-Asparganine:diversitywater               -0.404093   0.165450
## SubstrateL-Phenylalanine:diversitywater             -0.018630   0.165450
## SubstrateL-Serine:diversitywater                    -0.310259   0.165450
## SubstrateL-Threonine:diversitywater                  0.133130   0.165450
## SubstrateN-Acetyl-D-Glucosamine:diversitywater      -0.037500   0.165450
## SubstratePhenylethylamine:diversitywater            -0.006741   0.165450
## SubstratePutrescine:diversitywater                   0.118000   0.165450
## SubstratePyruvic Acid Methyl Ester:diversitywater   -0.089815   0.165450
## SubstrateTween 40:diversitywater                    -0.120519   0.165450
## SubstrateTween 80 :diversitywater                    0.112963   0.165450
## SubstrateWater:diversitywater                        0.561426   0.165450
## Substrateα-Cyclodextrin:diversitywater               0.093185   0.165450
## Substrateα-D-Lactose:diversitywater                  0.277833   0.165450
## Substrateα-Ketobutyric Acid:diversitywater           0.079000   0.165450
## Substrateβ-Methyl-D- Glucoside:diversitywater        0.252259   0.165450
## Substrateγ-Hydroxybutyric Acid:diversitywater        0.425648   0.165450
##                                                     t value Pr(>|t|)    
## (Intercept)                                           7.364 2.23e-13 ***
## Substrate4-Hydroxy Benzoic Acid                       3.525 0.000428 ***
## SubstrateD-Cellobiose                                 3.503 0.000467 ***
## SubstrateD-Galactonic Acid γ-Lactone                  0.997 0.318723    
## SubstrateD-Galacturonic Acid                          3.269 0.001090 ** 
## SubstrateD-Glucosaminic Acid                          2.001 0.045477 *  
## SubstrateD-Mallic Acid                                0.083 0.934152    
## SubstrateD-Mannitol                                   5.402 7.04e-08 ***
## SubstrateD-Xylose                                     2.850 0.004405 ** 
## SubstrateD.L -α-Glycerol Phosphate                   -3.595 0.000329 ***
## SubstrateGlucose-1-Phosphate                          0.137 0.891351    
## SubstrateGlycogen                                     2.069 0.038665 *  
## SubstrateGlycyl-L-Glutamic Acid                       0.396 0.692098    
## Substratei-Erythitol                                  0.190 0.849237    
## SubstrateItaconic Acid                                2.496 0.012599 *  
## SubstrateL-Arginine                                   3.699 0.000220 ***
## SubstrateL-Asparganine                                6.482 1.04e-10 ***
## SubstrateL-Phenylalanine                              1.473 0.140744    
## SubstrateL-Serine                                     4.969 7.07e-07 ***
## SubstrateL-Threonine                                  0.526 0.598593    
## SubstrateN-Acetyl-D-Glucosamine                       5.087 3.83e-07 ***
## SubstratePhenylethylamine                             1.148 0.250950    
## SubstratePutrescine                                  -0.150 0.880727    
## SubstratePyruvic Acid Methyl Ester                    3.349 0.000819 ***
## SubstrateTween 40                                     2.531 0.011432 *  
## SubstrateTween 80                                     2.844 0.004479 ** 
## SubstrateWater                                       -5.207 2.03e-07 ***
## Substrateα-Cyclodextrin                               0.364 0.715593    
## Substrateα-D-Lactose                                 -0.639 0.522544    
## Substrateα-Ketobutyric Acid                          -0.803 0.421843    
## Substrateβ-Methyl-D- Glucoside                        0.485 0.627593    
## Substrateγ-Hydroxybutyric Acid                       -1.822 0.068554 .  
## diversitywater                                       -4.799 1.66e-06 ***
## Substrate4-Hydroxy Benzoic Acid:diversitywater       -1.271 0.203675    
## SubstrateD-Cellobiose:diversitywater                  0.313 0.754166    
## SubstrateD-Galactonic Acid γ-Lactone:diversitywater   0.610 0.541600    
## SubstrateD-Galacturonic Acid:diversitywater          -0.103 0.918167    
## SubstrateD-Glucosaminic Acid:diversitywater          -0.770 0.441446    
## SubstrateD-Mallic Acid:diversitywater                 0.647 0.517493    
## SubstrateD-Mannitol:diversitywater                   -0.815 0.415093    
## SubstrateD-Xylose:diversitywater                     -2.009 0.044640 *  
## SubstrateD.L -α-Glycerol Phosphate:diversitywater     2.836 0.004590 ** 
## SubstrateGlucose-1-Phosphate:diversitywater           1.200 0.230142    
## SubstrateGlycogen:diversitywater                      0.991 0.321913    
## SubstrateGlycyl-L-Glutamic Acid:diversitywater        0.658 0.510853    
## Substratei-Erythitol:diversitywater                   0.973 0.330681    
## SubstrateItaconic Acid:diversitywater                -1.017 0.309235    
## SubstrateL-Arginine:diversitywater                   -1.441 0.149750    
## SubstrateL-Asparganine:diversitywater                -2.442 0.014641 *  
## SubstrateL-Phenylalanine:diversitywater              -0.113 0.910354    
## SubstrateL-Serine:diversitywater                     -1.875 0.060844 .  
## SubstrateL-Threonine:diversitywater                   0.805 0.421076    
## SubstrateN-Acetyl-D-Glucosamine:diversitywater       -0.227 0.820706    
## SubstratePhenylethylamine:diversitywater             -0.041 0.967504    
## SubstratePutrescine:diversitywater                    0.713 0.475766    
## SubstratePyruvic Acid Methyl Ester:diversitywater    -0.543 0.587267    
## SubstrateTween 40:diversitywater                     -0.728 0.466400    
## SubstrateTween 80 :diversitywater                     0.683 0.494803    
## SubstrateWater:diversitywater                         3.393 0.000698 ***
## Substrateα-Cyclodextrin:diversitywater                0.563 0.573320    
## Substrateα-D-Lactose:diversitywater                   1.679 0.093193 .  
## Substrateα-Ketobutyric Acid:diversitywater            0.477 0.633046    
## Substrateβ-Methyl-D- Glucoside:diversitywater         1.525 0.127430    
## Substrateγ-Hydroxybutyric Acid:diversitywater         2.573 0.010134 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6079 on 3392 degrees of freedom
## Multiple R-squared:  0.252,  Adjusted R-squared:  0.2381 
## F-statistic: 18.14 on 63 and 3392 DF,  p-value: < 2.2e-16

QUESTION 3

If there are differences between samples and on which C-substrates are driving those differences?

Yes, there are differences. This can determined by seeing which C-substrates are signifcant in the above model.

Question 4:

Does the dilution factor change any of these answers?

Lets take a look and make some more models

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## creek$Dilution   1  16.75  16.748   109.7 <2e-16 ***
## Residuals      862 131.64   0.153                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## soil1$Dilution   1   21.4   21.41   35.13 4.46e-09 ***
## Residuals      862  525.4    0.61                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## soil2$Dilution   1   54.0   53.99   105.8 <2e-16 ***
## Residuals      862  439.8    0.51                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##                      Df Sum Sq Mean Sq F value   Pr(>F)    
## wastewater$Dilution   1  12.74  12.740   47.74 9.44e-12 ***
## Residuals           862 230.01   0.267                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now lets add some predictions to a chosen model

data<- add_predictions(data= data, model = mod5)

p<-ggplot(data, aes(x= Substrate, y= values))+
  geom_point()+ facet_wrap(~diversity)+
  geom_point(aes(y=pred), col= "orange")+
  theme(axis.text.x = element_text(angle = 90))


ggplotly(p)

Less carbon is being consumed as the concentration of the soil samples increase.
More carbon is being consumed as the concentration of the water samples increase.

QUESTION 5

Do the control samples indicate any contamination?

Since water is a negative control, if the BioLog reads anything other than a 0 we can assume there was contamination.

## # A tibble: 108 x 9
##    Sample.ID     Rep Well  Dilution Substrate inc_hrs values diversity     pred
##    <fct>       <int> <fct>    <dbl> <fct>     <chr>    <dbl> <chr>        <dbl>
##  1 Clear_Creek     1 A1       0.001 Water     Hr_144       0 water     2.90e-14
##  2 Clear_Creek     1 A1       0.001 Water     Hr_48        0 water     2.90e-14
##  3 Clear_Creek     1 A1       0.001 Water     Hr_24        0 water     2.90e-14
##  4 Clear_Creek     1 A1       0.01  Water     Hr_144       0 water     2.90e-14
##  5 Clear_Creek     1 A1       0.01  Water     Hr_48        0 water     2.90e-14
##  6 Clear_Creek     1 A1       0.01  Water     Hr_24        0 water     2.90e-14
##  7 Clear_Creek     1 A1       0.1   Water     Hr_144       0 water     2.90e-14
##  8 Clear_Creek     1 A1       0.1   Water     Hr_48        0 water     2.90e-14
##  9 Clear_Creek     1 A1       0.1   Water     Hr_24        0 water     2.90e-14
## 10 Clear_Creek     2 A1       0.001 Water     Hr_144       0 water     2.90e-14
## # … with 98 more rows

Notice the values for Hr_24, Hr_48, and Hr_144, we can see that all of the values are 0, telling us that there was no contamination.